In this guided practice, we will use the Home Mortgage Disclosure Act Data, NY, 2015 provided and compiled by the Consumer Finance Protection Board. This dataset covers all mortgage decisions made in 2015 for the state of New York.
data_selection <- readRDS('./data/asec/sample_mortgage.RDS')
data_selection
As you can see, this dataset contains various information regarding the people who applied for loans and the neighborhoods in which they live: demographic and socio-economical. Our goal for this guided practice is to elaborate KNN and Naive Bayes models that predict whether the person would recieve the loan or not, and explore how these models contain an explicit or implicit racial bias.
First, we are going to try…
Our dataset has already been preprocessed, so we don’t have to
include a lot of steps in the recipe for the workflow. We are only going
to set the formula action_taken_name ~ ., since we want to
use all our variables to predict the action taken regarding the mortgage
and we are only going to update the role of the column
respondent_id so it becomes the id column for the
model.
recipe <- recipe(action_taken_name ~ ., data = data_selection)%>%
update_role(respondent_id, new_role = "id")
We are going to create the workflow and apply the recipe to it.
Remember that the idea behind tidymodels is that the whole
modeling process can be contained in a workflow, where we can load,
update, and extract parts from different trained models.
wf <- workflow() %>%
add_recipe(recipe)
Next, we are going to create the Naive Bayes model we want to train
for the workflow. tidymodels only contains the
classification models incldued in the parnsip package,
which don’t include Naive Bayes. So for this kind of model we have to
load the discrim library.
library(discrim)
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("naivebayes")
nb_spec
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
Next, we add the model to the workflow with
add_model.
wf <- wf %>% add_model(nb_spec)
And we fit it to our dataset. In the context of machine learning, fitting refers to the process of training the algorithm on a set of data, so that the model can make accurate predictions on new, unseen data.
nb_fit <- wf %>% fit(data_selection)
nb_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.default(x = maybe_data_frame(x), y = y, usekernel = TRUE)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## Loan not originated Loan originated
## 0.3373042 0.6626958
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: agency_name (Categorical)
## ---------------------------------------------------------------------------------
##
## agency_name Loan not originated
## Consumer Financial Protection Bureau 0.41630013
## Department of Housing and Urban Development 0.37787840
## Federal Deposit Insurance Corporation 0.03208279
## Federal Reserve System 0.02108668
## National Credit Union Administration 0.09805951
## Office of the Comptroller of the Currency 0.05459250
##
## agency_name Loan originated
## Consumer Financial Protection Bureau 0.42338842
## Department of Housing and Urban Development 0.33232370
## Federal Deposit Insurance Corporation 0.03331797
## Federal Reserve System 0.02047804
## National Credit Union Administration 0.11951011
## Office of the Comptroller of the Currency 0.07098176
##
## ---------------------------------------------------------------------------------
## ::: applicant_ethnicity_name (Categorical)
## ---------------------------------------------------------------------------------
##
## applicant_ethnicity_name Loan not originated
## Hispanic or Latino 0.08551100
## Information not provided by applicant in mail, Internet, or telephone application 0.14812419
## Not applicable 0.01319534
## Not Hispanic or Latino 0.75316947
##
## ...
## and 69 more lines.
As you can see from the output, the model was trained and fitted, and it returns the probabilities for each of the variables: the predictors and the target. It also returns various tabels with the coefficient value for each variable.We can explore the values of each coefficient doing so:
extract_fit_engine(nb_fit)$tables$applicant_race_name_1
##
## applicant_race_name_1 Loan not originated
## American Indian or Alaska Native 0.006597671
## Asian 0.077231565
## Black or African American 0.101164295
## Information not provided by applicant in mail, Internet, or telephone application 0.163130660
## Native Hawaiian or Other Pacific Islander 0.004010349
## Not applicable 0.012031048
## White 0.635834411
##
## applicant_race_name_1 Loan originated
## American Indian or Alaska Native 0.002041219
## Asian 0.067228551
## Black or African American 0.052479094
## Information not provided by applicant in mail, Internet, or telephone application 0.080529400
## Native Hawaiian or Other Pacific Islander 0.001975374
## Not applicable 0.176400869
## White 0.619345493
Now, how does this model work predicting the data? First, we are going to generate a test_val dataframe that has a column with the models’ predictions.
test_val <- nb_fit %>%
predict(data_selection) %>%
bind_cols(., data_selection) %>%
rename(pred_nb = .pred_class)
Since we are not getting into evaluation metrics quite yet, we are going to explore the results comparing the proportion of predicted values within the real ones.
test_val%>%
group_by(action_taken_name, pred_nb) %>%
summarize(n= n()) %>%
ggplot()+
geom_col(aes(x = action_taken_name, y = n, fill = pred_nb),
position = "fill")+
theme_minimal()
Naive Bayes doesn’t perform so badly in this case. Around 73% of the not originated loans were predicted correctly, and around 87% of the originated loans were predicted as such.
Now, we are going to try and predict the same variable with a KNN classifier. First, we are going to load the knn library in order to instance this type of model. Second, we will create the model with the function nearest_neighbor(). To are going to set the parameter neighbors is used to define the number of neighbors the model will use in order to make the predictions. For this example, we will set it to 5 neighbors.
library(kknn)
knn_spec <- nearest_neighbor(
neighbors = 5
) %>%
set_engine("kknn") %>%
set_mode("classification")
Now, we will fit the model to our data.
knn_fit <- wf %>%
update_model(knn_spec) %>%
fit(data_selection)
And predict the values on the dataset.
test_val <- knn_fit %>%
predict(data_selection) %>%
bind_cols(., test_val) %>%
rename(pred_knn = .pred_class)
How did the model perform?
test_val%>%
group_by(action_taken_name, pred_knn) %>%
summarize(n= n()) %>%
ggplot()+
geom_col(aes(x = action_taken_name, y = n, fill = pred_knn),
position = "fill")+
theme_minimal()
write_rds(knn_fit, './models/knn_fit.RDS')
Apparently, this model predicted perfectly. This would be good if we were trying out the model in data we didn’t train the model on. However, we are predicting on the same dataset we fitted the model. Also, we are basing the model on the categories of 5-near neighbors, which is quite a small number. This could mean that the model we trained can capture almost perfectly our dataset’s categories… but it’s way too specific for the cases we have, and doesn’t predict well new cases. What happens if we use K = 100? First, we set the parameter…
knn_spec_100 <- nearest_neighbor(
neighbors = 100
) %>%
set_engine("kknn") %>%
set_mode("classification")
And then, we refit and predict the values.
knn_fit_100 <- wf %>%
update_model(knn_spec_100) %>%
fit(data_selection)
test_val <- knn_fit_100 %>%
predict(data_selection) %>%
bind_cols(., test_val) %>%
rename(pred_knn_100 = .pred_class)
And we visualize the results.
test_val%>%
group_by(action_taken_name, pred_knn_100) %>%
summarize(n= n()) %>%
ggplot()+
geom_col(aes(x = action_taken_name, y = n, fill = pred_knn_100),
position = "fill")+
theme_minimal()
Now, the model doesn’t predict the cases perfectly, but it is a way more
appropiate model to apply in new scenarios! It even improves Naive
Bayes’ performance, capturing 76% of the not originated loans and 88% of
the originated loans correctly.
The dataset which predicted the target variable of the model contains different categories referring to the applicant’s race and ethnicity. In real life, we know that the race of a person could make that person a more (or less) favorable candidate to recieve a loan. So, what happens when a computer learns from a racially-motivated premise to give out mortgages?
We are going to explore these results in our model comparing the predicted target values in white and non-white applicants. First, we are going to create an ad-hoc variable with the races of applicants grouped in those two categories.
test_val <- test_val %>% mutate(
race_grouped = case_when(
applicant_race_name_1 == "White" ~ "White",
applicant_race_name_1 %in% c("Information not provided by applicant in mail, Internet, or telephone application", "Not applicable") ~ "Information not supplied",
TRUE ~ "Non-white"
)
)
Now, we are going to see the proportion of correctly classified cases in the different groups, and compare each model.
results_nb <- test_val%>%
filter(race_grouped != "Information not supplied") %>%
group_by(race_grouped, action_taken_name, pred_nb) %>%
summarize(n= n()) %>%
mutate(perc = round(n/sum(n)*100,2))
## `summarise()` has grouped output by 'race_grouped', 'action_taken_name'. You
## can override using the `.groups` argument.
results_knn <- test_val%>%
filter(race_grouped != "Information not supplied") %>%
group_by(race_grouped, action_taken_name, pred_knn_100) %>%
summarize(n= n()) %>%
mutate(perc = round(n/sum(n)*100,2))
## `summarise()` has grouped output by 'race_grouped', 'action_taken_name'. You
## can override using the `.groups` argument.
p1 <- ggplot(results_nb, aes(x = pred_nb, y = action_taken_name)) +
geom_tile(fill = "white", color = "black") +
geom_text(aes(label=perc))+
scale_fill_viridis_c()+
facet_wrap(vars(race_grouped),
nrow = 2,
scales = "free")+
labs(title = "Naive Bayes classifier",
x = "Prediction",
y = "Real value")+
theme_minimal()
p2 <- ggplot(results_knn, aes(x = pred_knn_100, y = action_taken_name)) +
geom_tile(fill = "white", color = "black") +
geom_text(aes(label=perc))+
scale_fill_viridis_c()+
facet_wrap(vars(race_grouped),
nrow = 2,
scales = "free")+
labs(title = "KNN classifier",
x = "Prediction",
y = "Real value")+
theme_minimal()
library(patchwork)
p1 / p2
When we compare the predictions of both models among the different groups, we can see that the proportion of correctly classified cases varies. In the case of white people, both models show a higher classification error in the case of not originated loans. This means that both models have a bias towards classifying white people into an originated loan, which is to say it tends to give out loans to white people more than non-white people. We can also see this in the case of classifications for non-white people. In the KNN classifier, both categories have similar classification errors. But in the case of Naive Bayes, we can see that it tends to misclassify non-white people into the “Loan not originated” category.
As you can see in this plot, KNN classifier doesn’t have error or bias, but because it is predicting the cases exactly as they are. But Naive Bayes shows a higher misclasification proportion in the case of non-white applicants: they tend to be classified towards a not originated loan. This shows that the model is racially biased: it tends to deny loans to non-white people.
Now, could this be avoided if we removed the category
applicant_race_name_1 from the model? Let’s try it out. We
must update the recipe to remove certain variables, and refit the
model.
new_recipe <- recipe(action_taken_name ~ ., data = data_selection)%>%
update_role(respondent_id, new_role = "id") %>%
step_rm(c(applicant_race_name_1, applicant_ethnicity_name))
new_wf <- wf %>%
update_recipe(new_recipe)
new_nb_fit <- new_wf %>%
update_model(nb_spec) %>%
fit(data_selection)
new_knn_fit <- new_wf %>%
update_model(knn_spec_100) %>%
fit(data_selection)
Now, we predict with the new model on the dataframe and evaluate its bias following the same steps we did before.
test_val <- new_nb_fit %>%
predict(test_val) %>%
bind_cols(., test_val) %>%
rename(pred_nb_new = .pred_class)
results_nb_new <- test_val%>%
filter(race_grouped != "Information not supplied") %>%
group_by(race_grouped, action_taken_name, pred_nb_new) %>%
summarize(n= n()) %>%
mutate(perc = round(n/sum(n)*100,2))
test_val <- new_knn_fit %>%
predict(test_val) %>%
bind_cols(., test_val) %>%
rename(pred_knn_new = .pred_class)
results_knn_new <- test_val%>%
filter(race_grouped != "Information not supplied") %>%
group_by(race_grouped, action_taken_name, pred_knn_new) %>%
summarize(n= n()) %>%
mutate(perc = round(n/sum(n)*100,2))
p3 <- ggplot(results_nb_new, aes(x = pred_nb_new, y = action_taken_name)) +
geom_tile(fill = "white", color = "black") +
geom_text(aes(label=perc))+
scale_fill_viridis_c()+
facet_wrap(vars(race_grouped),
nrow = 2,
scales = "free")+
labs(title = "Naive Bayes classifier",
x = "Prediction",
y = "Real value")+
theme_minimal()
p4 <- ggplot(results_knn_new, aes(x = pred_knn_new, y = action_taken_name)) +
geom_tile(fill = "white", color = "black") +
geom_text(aes(label=perc))+
scale_fill_viridis_c()+
facet_wrap(vars(race_grouped),
nrow = 2,
scales = "free")+
labs(title = "KNN Bayes classifier",
x = "Prediction",
y = "Real value")+
theme_minimal()
p3 / p4
While this reduces the bias in the model, it doesn’t erase it completely. This is because there are certain variables which could be associated to the race of applicants, and removing only this variable does not completely reduce the model’s bias.